From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning
نویسندگان
چکیده
Video captioning in essential is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, etc. In this paper we build on the recent progress in using encoder-decoder framework for video captioning and address what we find to be a critical deficiency of the existing methods, that most of the decoders propagate deterministic hidden states. Such complex uncertainty cannot be modeled efficiently by the deterministic models. In this paper, we propose a generative approach, referred to as multi-modal stochastic RNNs networks (MS-RNN), which models the uncertainty observed in the data using latent stochastic variables. Therefore, MS-RNN can improve the performance of video captioning, and generate multiple sentences to describe a video considering different random factors. Specifically, a multimodal LSTM (M-LSTM) is first proposed to interact with both visual and textual features to capture a high-level representation. Then, a backward stochastic LSTM (S-LSTM) is proposed to support uncertainty propagation by introducing latent variables. Experimental results on the challenging datasets MSVD and MSR-VTT show that our proposed MS-RNN approach outperforms the state-of-the-art video captioning benchmarks.
منابع مشابه
A Deep Generative Model for Disentangled Representations of Sequential Data
We present a VAE architecture for encoding and generating high dimensional sequential data, such as video or audio. Our deep generative model learns a latent representation of the data which is split into a static and dynamic part, allowing us to approximately disentangle latent time-dependent features (dynamics) from features which are preserved over time (content). This architecture gives us ...
متن کاملStochastic Video Prediction with Conditional Density Estimation
Frame-to-frame stochasticity is a major challenge in video prediction. The use of standard feedforward and recurrent networks for video prediction leads to averaging of future states, which can in part be attributed to the networks’ limited ability to model stochasticity. We propose the use of conditional variational autoencoders (CVAE) for video prediction. To model multi-modal densities in fr...
متن کاملUsing a new modified harmony search algorithm to solve multi-objective reactive power dispatch in deterministic and stochastic models
The optimal reactive power dispatch (ORPD) is a very important problem aspect of power system planning and is a highly nonlinear, non-convex optimization problem because consist of both continuous and discrete control variables. Since the power system has inherent uncertainty, hereby, this paper presents both of the deterministic and stochastic models for ORPD problem in multi objective and sin...
متن کاملAutomated Image Captioning Using Nearest-Neighbors Approach Driven by Top-Object Detections
The significant performance gains in deep learning coupled with the exponential growth of image and video data on the Internet have resulted in the recent emergence of automated image captioning systems. Two broad paradigms have emerged in automated image captioning, i.e., generative model-based approaches and retrieval-based approaches. Although generative model-based approaches that use the r...
متن کاملSequence to Sequence Model for Video Captioning
Automatically generating video captions with natural language remains a challenge for both the field of nature language processing and computer vision. Recurrent Neural Networks (RNNs), which models sequence dynamics, has proved to be effective in visual interpretation. Based on a recent sequence to sequence model for video captioning, which is designed to learn the temporal structure of the se...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1708.02478 شماره
صفحات -
تاریخ انتشار 2017